56
Algorithms for Binary Neural Networks
Algorithm 2 Discrete backpropagation via projection
Input:
The training dataset; the full-precision kernels C; the projection matrix W; the learning rates
η1 and η2.
Output:
The binary or ternary PCNNs are based on the updated C and W.
1: Initialize C and W randomly;
2: repeat
3:
// Forward propagation
4:
for l = 1 to L do
5:
ˆCl
i,j ←P(W, Cl
i); // using Eq. 3.43 (binary) or Eq. 3.59 (ternary)
6:
Dl
i ←Concatenate( ˆCi,j); // using Eq. 3.45
7:
Perform activation binarization; //using the sign function
8:
Traditional 2D convolution; // using Eq. 3.46, 3.47 and 3.48
9:
end for,
10:
Calculate cross-entropy loss LS;
11:
// Backward propagation
12:
Compute δ ˆ
Cl
i,j =
∂LS
∂ˆ
Cl
i,j ;
13:
for l = L to 1 do
14:
// Calculate the gradients
15:
calculate δCl
i; // using Eq. 3.49, 3.51 and 3.52
16:
calculate δW l
j ; // using Eq. 3.115, 3.116 and 3.56
17:
// Update the parameters
18:
Cl
i ←Cl
i −η1δCl
i; // Eq. 3.50
19:
W l
j ←W l
j −η2δW l
j ; //Eq. 3.54
20:
end for
21:
Adjust the learning rates η1 and η2.
22: until the network converges
We believe that compressed ternary CNNs such as TTN [299] and TWN [130] have
better initialization states for binary CNNs. Theoretically, the performance of models with
ternary weights is slightly better than those with binary weights and far worse than those
of real-valued ones. Still, they provide an excellent initialization state for 1-bit CNNs in
our proposed progressive optimization framework. Subsequent experiments show that our
PCNNs trained from a progressive optimization strategy perform better than those from
scratch, even better than the ternary PCNNs from scratch.
The discrete set for ternary weights is a special case, defined as Ω := {a1, a2, a3}. We
further require a1 = −a3 = Δ as Eq. 3.57 and a2 = 0 to be hardware friendly [130].
Regarding the threshold for ternary weights, we follow the choice made in [229] as
Δl = σ × E(|Cl|) ≈σ
I
I
i
∥Cl
i∥1
,
(3.58)
where σ is a constant factor for all layers. Note that [229] applies to Eq. 3.58 on convolutional
inputs or feature maps; we find it appropriate in convolutional weights as well. Consequently,
we redefine the projection in Eq. 3.29 as
PΩ(ω, x) = arg min
ai ∥ω ◦x −2ai∥, i ∈{1, ..., U}.
(3.59)
In our proposed progressive optimization framework, the PCNNs with ternary weights
(ternary PCNNs) are first trained from scratch and then served as pre-trained models to
progressively fine-tune the PCNNs with binary weights (binary PCNNs).